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Conventions 



Convention 

commands 

Screen displays 
[Key] names 

Bold-face text 

Words in Italicized type 



I Description ~~ " " 

This typeface means that the command must be entered exactly as shown in the text and the 
[Return] or [Enter] key pressed. 

This typeface represents information as it appears on the screen and is generally enclosed within 
a bounding box. 

Key names appear in the text written with brackets. For example [Return] or rjF7 ] . If it is 
necessary to press more than one simultaneously, the key names are linked with a plus (+) siqn- 
Press ICtrl]+tAlfc] + rDel] v ' a * 

iflcatton 3m8 ' ,nstruction ' re 9 ister or arguments referred to in main body text for purposes of clar- 
Selections made via the menu hierarchy of a software application. 
Italics emphasize a point, concept or denote new terms. 



Disclaimer 

IVnteTnToina ^outlm^H '? ""k T™",'' t0 . 9ether with the information contained in any and a.l associated C/earSpeetfdocu- 
■ and i » 2 f 7 « ' , X ' app,,cation notes and «» '"< e ('Information-) is provided in connection with CtearSpeed prod- 
ucts and .s provjded for information only. Quoted figures in the Information, which may be performance, size cost power and the like are 
estimates based upon analysis and simulations of current designs and are liable to change P 

tJnf^ 0 ™?™ ?* f nstitute an < ? ffer of - or an Invitation by or on behalf of ClearSpeed, or any ClearSpeed affiiiate to supply any 

Satefor S!Z^ « ^ r^'" 9 aC ° eSS *° thiS ' nf °™ a «on. as P rov.ded in ClearSpeed Verms and Conditions of 

bale for C/earSpeerfproducts, ClearSpeed assumes no liability whatsoever. 

a P plStion t r e ° r Pf ° dUC,S " 0t ,ntSnded USS ' WhBther direcUy ° r indireCt,y ' in a " y medlcal - m savi " 9 and/ or life ^staining systems or 

exorlss o^.m^-d USSS^ J^T "C" the ,nformat,on and date contai ned therein is owned by C/earSpeerf. No license whether 

VZZSfi???"*? *° i^telle0tUa, Pr ° Perty ri9ht3 iS 9ranted by this ^n-antor otherwise. You may not 
oownioad, copy, adapt or distribute this Information except with the consent in writing of CtearSp eed. 

%o]rinTl? n Tr.u r ?T >nS Z?"T ," ss P° nslb,e ror a "V a nd a" design, functionality and terms of sale of any product which Incorporates a 
£2K5£^ ' m,taUOn ' PfOdU0t " abmty ' ,ntSlleCtUa ' PropSrty in '™ 9a ™nt. — n«y IncLng conformanceCeci- 

EOS IfT; Wa 7" ty ° f 0<her te ™ WhlCh mi9ht but for ,hls Paragraph have effect between ClearSpeed and you or which would other- 
T fitneslT n T "kTT" in, ° *' ,nformatton C nclud| n9 without limitation, the implied terms o/satlsfactory qualty merchantabmy 
or fitness for purpose), whether by statute, common law or otherwise are hereby excluded. mercnamaouuy 

7. ClearSpeed reserves the right to make changes to the Information or the data contained therein at any time without notice. 
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Traffic Handler patent summary 



1 . Summary of proposals 



The document includes proposals for the following inventions: 

Programmable 40 Gbits/s Traffic Handler -A traffic handier architecture in which packets are processed by soft- 
ware and inserted into an orderlist for scheduling. 

Packet storage system for traffic handling - A Memory Hub, used for buffering packets in a high line rate Traffic 
Handier. 

State Element - A smart memory cell for serialising accesses to shared state variables. 

State Engine - A formal framework for designing an active state storage system using state elements 

Programmable orderlist manager -A system for maintaining ordered logical data structures in software at high 
speeds 

Overlapped Virtual Queueing - A low overhead method for setting up and tearing down virtual queues 



Notes: 



• Figure and reference indexes apply locally within each chapter of this document 

• This is a summary document. There will be a lot of additional detail (functions and claims) relating to each 
proposal which may not be covered in this document. 

-> 

• In each proposal; additional related design work is also listed. These are ideas which are to be considered 
to have potential either as sub-patents, or maybe as independent proposals in their own right. 
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2. A programmable 40 Gbfts/s Traffic Handler 



2.1 Backaround 

A basic knowledge of the function and anatomy of an internet router is assumed. 

J'^™ 6 !. t«t Can deNVer PaCketS fr ° m mU,tiple in9ress ports t0 one of a number °f ^ress ports. The 
SSS^SS^S a T S £~* P0 1r St thSr ' tranSm ' t thSSS pack5ts over some ^rnunication medium to the 
Sn fwou d t SimS!^ i tS ° f transmission is normally 'imited to a standard rate. For instance, an OC-768 
link would transmit packets over an optical fibre at a rate of 40 Gbits/s. 

With many independent ingress paths delivering packets for transmission at egress, the time averaged rate of de- 
livery cannot exceed 40 Gbits/s for this example. Although over time the input and output rates are extent me 

SETS? rT7 ° f tra ?° by I" 6 fabri ° 18 bUrSty in " atUre With rates often P eaki "9 the 40 Gbits/s threshold 
Since the rate of rece.pt can be greater than the rate of transmission, short term packet queueing required a 

' WJrrjXaS A S,mP ' e qU6Ue ' S ad6qUate f ° r ** PUrp ° Se ~s q which JJSTflS 

^ZfZZn^ el 0 " 6 " 1 ? S T reqUifed r ° UterS WhiCh provide Tram ° Management. In a converged internetwork, 
nn tZ J^ T applications require different grades of service in order to run effectively. Email can be carted 
, a lf m T WhSre n ° 9Uarantees are made ^Sarding rate of or delay in delivery. Real-time voice data 
has a much more demand.ngrequ.rement for reserved transmission bandwwidth and guaranteed minimum delav 
in del.very Th.s cannot be acheived if all traffic is buffered in the same FIFO queue. A queue per so-called Ctess 
of Serv.ce ,s requ.red so that traffic routed through higher priority queues can oypass that in lower priority queues 
5e W ^?r a e ^ S HT a H y , a,S ° bS "J 6 " 3 9 uaranteed Portion of the available output line bandwidth. The^leaSpeS 
v,ew of Traffic Handl.ng ,n context is described in the ClearSpeed Traffic Management system whitepaper [1]. 

2.2 The problem and prior art 

Sir fi ™nf,S *!* tra ?° hand "2 9 task appears t0 be straightforward. Packets are placed in queues according to 
?H Th« <* service. For every forwarding treatment that a system provides, a queue must be implement- 

ed. These queues are then managed by the following mechanisms: P 

• Queue management assigns buffer space to queues and prevents overflow 

' blc a klo U ggld are implemented to ^"se traffic sources to slow their transmission rates if queues become 

* mSS " n9 controls the de q"eueing process by dividing the available output line bandwidth between the 

Different service levels can be provided by weighting the amount of bandwidth and buffer space allocated to dif- 
R^f.nH q Rnh i n S mpo by pr, ° nt ! sed P acket dr °PP ina times of congestion. Weighted Fair Queueing (WFQ), Deficit 
wE ^ k DRR ), sched "" n g. Weighted Random Early Detect (WRED) are just a few of the many algorithms 
** m f be . e ^Ployed to perform these scheduling and congestion avoidance tasks. See reference [2] fora 
Thorough description of these algorithms. 1 

In reality, system realisation is confounded by some difficult implementation issues: 

' mlmories^f ^ ^^^uFf P?*?^?* 0 ** to rapidly develop durin 9 brief congestion events. Large 
memories of the order 500 MBytes to 1 GBytes are required for 40 Gbits/s line rates 

Thf* dlmfnZ r{ ^l r M te t ° an b J* h . igh u due to overs P eed in the packet delivery from the switch fabric. 
I also ^reqCiSd 3 W bandwidth lnto memor V- More importantly, high address bandwidth 

• The processing overhead of some scheduling and congestion avoidance algorithms is high. 
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• Priority queue ordering for some (FQ) scheduling algorithms is a non-trivfa! problem at high speeds. 

• A considerable volume of state must be maintained in support of scheduling and congestion avoidance 
algorithms, to which low latency access is required. The volume of state increases with the number of 
queues implemented. 

• As new standards and algorithms emerge, the specification is a moving target. To find a flexible (ideally 
programmable) solution is therefore a high priority. 

In a conventional approach to traffic scheduling, one might typicaily place packets directly into an appropriate 
queue on arrival, and then subsequently dequeue packets from those queues into an output stream. The traffic 
scheduler determines the order of dequeueing. Since the scheduling decision can be processing intensive as the 
number of input queues, increases, queues are often arranged into small groups which are locally scheduled into 
an intermediate output queue. This output queue is then the input queue to a following scheduling stage. The 
scheduling problem is thus simplified using a 'divide-and-conquer* approach whereby high performance can be 
acheived through parallelism between groups of queues in a tree type structure, or so-called hierarchical link shar- 
ing scheme [2], 

This approach works in hardware up to a point. For the exceptionally large numbers of input queues (of the order 
64k) required for per-flow traffic handling, the first stage becomes unmanageably wide to a point that it becomes 
impractical to implement the required number of schedulers. 

Alternatively, in systems which aggregate all traffic into a small number of queues parallelism between hardware 
schedulers cannot be exploited. It then becomes extremely difficult to implement a single scheduler - even in op- 
timised hardware - that can meet the required performance point. 

With other congestion avoidance and queue management tasks to perform in addition to scheduling, it is apparent 
that a new approach to traffic handling is required. 

2.3 Summary of the invention 

A traffic handler architecture in which packets are processed by software and inserted into an orderlist for sched- 
uling. 

2.4 List of attached figures 



Processing 
system 



z3L 



Orderlist 
management 
system . 



Packet record 
processing — 



l/P 



Packet buffering system 



Packet memory 



O/P 



Packet 
handling 



Figure 2.1 Traffic Handler system functional overview showing principal components 
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Figure 2,2 Functional overview of the system of MTAPs and other ASIC 
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Figure 2.3 7ra/7c Handling system implementation 
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2.5 Detailed description 

Description of the concept and invention 

5 There are no separate, physics ! stscfsl Input CjUSuGS, 

• Packets are effectively sorted directly Into the output queue on arrival. A group of input queues thus exist in 
the sense of being interleaved together within the single output queue. 

• These interleaved 'input queues' are represented by state in the queue state engine.This state may track 
queue occupancy, finish time/number of the last packet in the queue etc. Occupancy can be used to deter- 
mine whether or not a newly arrived packet should be placed In the output queue, or whether It should be 
dropped (congestion management). Finish numbers are used to preserve the order of the 'input queues' 
within the output queue and determine an appropriate position in the output queue for newly arrived pack- 
ets (scheduling). 

• Scheduling and congestion avoidance decisions are thus made "on the fly" prior to enqueueing - a tech- 
nique referred to within CiearSpeed as "Think first queue later"™. 

• This technique is made possible by the deployment of a high performance data flow processor which can 
perform the required functions at wire speed. The CiearSpeed MTAP processor is ideal for this purpose, 
providing a large number of processing cycles per packet for packets arriving at rates as high as one every 
couple of system clock cycles. 

Details of the embodiment 

Figure 1 shows the MTAP processing system in relation to other components in the wider Traffic handling system. 

The packet buffering system and orderlist management system are described in detail in sister patents as each is 
an innovative solution to a more specific problem. 

Figure 2 shows a functional decomposition of the MTAP processing system. 

This architecture is described in detail in Application.Note 1 of the Per-Flow Traffic Handler design document [3]. 
This device is referred to as the Q-chip. 

Figure 3 shows a full traffic handler implementation using the Q-chip architecture. 

Q-Chip 2 is used to implement the orderlist management system. The Memory and Streaming hubs implement the 
Packet Buffering System. 

Additional related design work 

There are some additional points to note in our use of MTAP processors to perform Traffic Handling functions. Not . 
sure whether they form claims in this invention, or whether they are patentable ideas in their own right. 

Class of service tables: CoS parameters are used in scheduling and congestion avoidance calculations. They are 
conventionally read by processors as a fixed group of values from a class of service table in a shared memory. 
This places further demands on system bus and memory access bandwidth. The table size also limits the number 
of different classes of service which may be stored. 

An intrinsic capability of the CiearSpeed MTAP processor is rapid, parrailel local memory access. This can be used 
to advantage as follows: .. . 

• The Class of Service table is mapped into each PEs memory. This means that all passive state does not 
require lookup from external memory. Enormous internal memory addressing bandwidth of SIMD proces- 
sor is utilised. 

• By performing multiple lookups into local memories in a massively parallel fashion instead of single large 
lookups from a shared external table there is a huge number of different Class of Service combinations 
available from a relatively small volume of memory. 
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9 J£1h£^S^^ n PES \ PEs « n P e *>™ proxy lookups on behalf of each other. A single CoS table 
can therefore be spilt across two PEs thus halving the memory requirement. 

In the context of the ordinary use of MTAP processors this is not necessarily innovative. As a tool being used to 
improve runction ano perrormance in a I raffle Handiiing'sysiem this gives a reai advantage over the shared 
memory approach. 



Reference material 



[1] A. Spencer "Traffic management Whitepaper* 

- Background information on Traffic Management 

[2] S. Keshav "An Engineering Approach to Computer Networking", Addison-Wesley, 1997 

- Scheduling, congestion avoidance and hierarchical link sharing theory 

[3] A. Spencer "Per-Flow Traffic Handler" 

- Original design work for ClearSpeed Traffic Handling solution 



2.6 Key features of the invention 

" I^'f ° n f ' P l° k u X sc J? edu,in 9 Evolves parallel enqueueing and then serialised scheduling from those 
queues. For high performance traffic handling we have turned this around. Arriving packets are first proc- 
essed in parallel and subsequently enqueued in a serial orderlist. This is referred to as "Think First Queue 
Later"™ 

* ilinvaffln Tt ^lo pfpellne parallel processing architecture (the ClearSpeed MTAP processor) is 
innovative in a Traffic Handling application. It provides the wire speed processing capability which is 
essential for the implementation of this concept 

• An alternate form of parallelism (to independent parallel schedulers) is thus expioted in order to solve the 
processing issues in high speed Traffic Handling. 



2.7 Scope of claim 

The claim applies most specifically to the use of MTAP processors in a Traffic Handling device used for network 
traffic management. The claim could be broadened beyond the specific use of MTAPs to cover the more general 
TF-QL concept and the implementation of the orderiist. 
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3. A packet storage system for traffic handling 

3.1 Background 

A basic understanding of router anatomy and traffic handling is assumed. 

In an egress traffic handier, packets may arrive in bursts at a rate which exceeds the output line rate. Temporary 
packet buffering is .therefore required. By buffering packets in logical queues, different grades of service can be 
acheived by varying the allocation of the available resources (line bandwidth and memory capacity) amongst the 
queues. 

3.2 The problem and prior art 

I am unaware of what prior art exists that might conflict with this proposal. What I do know is that packet queueing 
and buffering at 40 Gbits/s has been referred to as being "impossible" by other players In the Network Procesing 
arena. Since ClearSpeed has a solution for this problem then it is reasonable to assume that either a specific idea, 
or an original combination of ideas described in this chapter must constitute an original solution to a definable prob- 
lem. The following issues in combination make packet buffering particularly difficult at 40 Gbits/s line rates: 

1. High data bandwidth is required to accommodate the simultaneous reading and writing of packets (at worst 
case fabric overspeed). 

2. High address bandwidth is required to cope with the worst case whereby streams of minimum sized pack- 
ets are simultaneously being written to and read from the memory in a random access mode. 

3. Memory capacity must be high as buffers will fill up rapidly during transient bursts at high line rates. 

4. The manipulation of state which is associated with either logical queue management or memory manage- 
ment must be minimised at high line rates. The number of system clock cycles typically available to the 
hardware or software device which performs such a function" will be minimal. 

A solution which places packets directly into queues mapped into statically assigned memory can meet (2) and 
(4) but uses memory inefficiently therefore fails on (3). A solution which buffers packets in on-chip memory or 
SRAM will be able to meet (2) but not (3) since SRAM is a low capacity memory. Implementing a solution which 
used high capacity DRAM will be able to meet (3) but will have difficulty in meeting (2) as the random access time 
is small. In attempting to meet (1), solutions need to implement high bandwidth interconnects and high pincount 
interfaces. 

In summary, it is very difficult to design architectures that meet all four criteria. 
3.3 Summary of the invention 

A Memory Hub, used for buffering packets in a high line rate Traffic Handler. 
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3.4 List of attached figures 
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F/gt/re 3. f Functional overview of the components of the packet storage system 
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Figure 3.2 Architecture I overview of the Memory Hub 
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Figure 3.3 Datagram Retrieval Unit design 
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Figure 3.4 Implementation of a packet storage system for traffic handling, (a) Multi-chip, scalable implementation 

approach, (b) Single chip, highly integrated solution. 

3.5 Detailed description 

This proposal describes an Invention in which component behaviours, ideas and devices are assembled into a 
solution which meets ail the required criteria for 40 Gbits/s packet buffering in a traffic handling system. Although 
certain peripheral behaviours form part of the overall solution, the Memory Hub is the primary embodiment of the 
invention. 

Description of the concept and invention 

Memory system decoupling - First, isolate the problem so that it is not entangled and interdependent with other 
functions. 

• Relatively complex functions may be used to control the enqueueing and dequeueing of packets. The 
complexity of such packet handling/processing can be alleviated somewhat if the packets themselves are 
not passed around the system. Since access to packet content is not required in traffic handling, the pack- 
ets can be placed in memory and be represented by small, fixed size packet records. It is these records 
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KSeTcE^^ ,09iCal qU , eU6S ° r data Stores. The packet records, scheduled 

in oraer, can subsequently be used to recover packets for forwarding on the output line. 

* l^dmerSoTmSlt^T mana9ement funcfions are thus ^upled from the task of packet buff- 

• Subsequent QoS processing is performed on a small, fixed sized record of packet metadata This record 

S e n K ? e pack et loc f on in memory (address of firat b,ock ). *■ SSS ES? *2n to 

ttenJ £ta P .^l5 9 . 8 (appended to - the P acket ^stream), the packet length and control flags. Addi- 
tional data is looked up locally using the stream identifier. 

SSriS^^ " MS «°7 iS SSnSra " y Stetica " y or ^signed when it is used as storage for data 

structures. Next define an efficient scheme for assigning memory to packets. 

' ferenfcoTfX^^ the memory address space. The blocks may be one of n dif- 

ferent configured sizes. For reduced system complexity, n=2 is considered suitable There is no static 

SS^^T^ ? q , UeUeS ' ' nStead ' P3CketS ar * stored int0 one « ' "SK ocks oT S ghTen sfeSfes 
appropriate). Each block of a grven packet points to the next in linked list fashion. - 

' manageme a nt feed ^ ^ PaCk6tS d ° n0t P °' nt ! ° °" e an ° ther ,n 0ther words ' there is no lo 9 ical 9 ueue 

' ml^Sf a " m . em0ry £'° CkS is recorded b y a memory manager in the memory hub. To do this the 
memory manager employs a bitmap - each bit representing a block. 

' ^oonS^^i^ thiS bitmap * the memory can identif y the addresses of free blocks in batches. Bits 
SfoSrfSS £™ addresses, and the addresses held in a central pool of limited but adequate size This is 
i^£s^ efficient tha " a memoryW (le. stor- 

' If ! ntral p °°' can . be t0 PP ed "P by either scanning the bitmap, or more directly from the stream of 
feltTrtft*-^* 38 - Pa f Ck , etS are read from memor y and * he memory bSte^Toccuw ale 
the Sp P ° ,S ' retUmed addresses must be buffered and their information inserted into 

Efficient packet storage - Placing information into memory will normally require the overhead of updating state 

^SS^S^TT'f ? ? 6 "f information - A means of smoothly steaming data into "memor* 2w5 fs 
required which does not get held up by intermediate state manipulation. 

* ISf J 6ViCe tf I at ? C6iVeS Packets from tne switch fabric ^ referred to as the Arrivals block. This pipelined 

or . e ? ra « s information from the packet which is used to create the packet record and sfces he 
memorv b.or°±V kS ^ be mapped into memo ^ blocks. An app ro P prtato flS?S^(S 
HuSJS^fS S selac t ted for eech packet based on its length.The packet is forwarded to the Memory 
Hub and the packet record to the system use for QoS processing and logical queueing. 

• Arrivals maintains its own local pool of available memory blocks by periodically reading batches of 
S^lflffiSF 1 P0 °'- A S6Parate ,OCa ' (and C6ntra,) P °°' iS required for -oh d£ent°me m on, 

' Iriv /^^ 0 iSSS? A w riVa, l t0 '° a * d (fresmented) packets immediately into free blocks of memory. The 
th^aurZ h V " d °' n ? th ,' S ,S *° insert the address of the last m6 mory block of the same packet into 
I Sff S ' mP 61 ** SyStSm WhiCh reqUireS n ° State manipulation other than to pop items 

' r^lsTh^ I 3 th , 3t * SUPP °^ S '° ad ba,ancin9 across fhe avai,able memory chan- 

whic ? are ^^aoDe7fnto diKnt nhJ°, Ca PO °' ' S re Plenished. it receives the addresses of memory blocks 
Z ° hf^S^ different physical memory channels in equal quantities. Consecutive blocks can 

££ toe* " CkS r ° Und r ° bin fashi ° n ' efficient 'y spreadin 9 the addre es and data Snd- 

' bVATrivali^T^flvo 2 * he '"S PO °' be 5° mes empt * packets may need to be partially or fully dropped 
ma^^^S?bv?h?o^^ rep °^ rouflh Packet record creation so that event handling may be 
raooS til Xl 11 V I Q ° S process °r- The processor must purge the packet from memory, and must 

£?£2S operation PP ^ ^ b ° th fUnCti ° ns the QoS processor already possesses 
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Traffic Hari(ij«?r patent suriimary ^ 



Efficient packet recovery - The packet storage function Is deliberately simplified as its 80G performance require- 
ment is challenging. Consequently, the 'open loop' method used by Arrivals makes the 40G packet recovery func- 
tion more complex. The main issue is that if a packet is stored in multiple blocks in memory, each pointing to the 
next, one block must be read and the 'next pointer' extracted before a request for the the next block can be issued. 

• The device that forwards packets to the line is the Dispatch block. It can be viewed as a type of DMA 
engine which gets addresses (packet records) from the QoS processing system, uses the address to fetch 
data (packet) from memory (the memory hub), and forwards the data to the line. 

• Memory is nominally organised into (n=)2 sets of blocks - large and small. Block sizes are selected io (a) 
optimise the usage of memory by matching peaks in the packet size distribution to the block sizes (fairly 
obvious), and (b) to make retrieval of packets stored as linked lists more efficient (less sol). When a 
packet is stored in multiple memory blocks there is a delay between the request issued and the first data 
(which holds the address of the next block) returning. If the block is large enough then the next pointer can 
be extracted from the first few bytes of a block and the next request issued while the remainder of the 
block is still being read from memory. 

• By selecting a memory block size such that the majority of packets can be stored in a single memory block 
frVo.'fw a £ gregate P acket recovery can be achieved. Packets can be fetched by a Datagram Retrieval Unit 
(DRU) in the Memory Hub in a fully pipelined mode (ie. a packet request may be sent before the response 
to a previous request has returned). 

• The recovery of multi-biock packets can be made efficient by implementing a hierarchical packet retrieval 
system which has data recovery occuring at a number of different levels. Dispatch requests complete 
packets from the Datagram Retrieval Unit (DRU). The DRU fetches memory blocks from the memory 
read controllers of individual memory channeis and reassembles packets. The memory read controllers 
fetch words from memory in burst read accesses, reassembling the block content by identifying the start 
and end of valid data within the block. 

• For each memory block that the DRU reads, the block address is passed to the memory manager so that 
the relevant memory map bits can be updated (or the block address reused). 

• A Datagram Discard Unit operates in a similar mode to the DRU but does not return data! Its purpose is to 
update the memory block state bits in the memory manager. This can be done directly for packets stored in 
single blocks. Packets which are stored in multiple blocks must be read from memory so that the 'next 
block' pointers can be recovered. 

Implementinq a viable system - The partitioning of a system must be considerate of real world issues such as de- 
vice pincounts, power dissipation, silicon process technology, PCB population and routing etc. 

• Multiple, independently addressable memory channels are accessed via a Memory Hub. The Hub isolates 
the memory type and the fabrication technology required for that memory from the rest of the system. The 
Hub also enables a large, scalable number of memory channels to be implemented as (a) the hub package 
maximises the number of pins available for memory channel implementation, and (b) multiple hubs may be 
connected to a single QoS processing chip (see next point). 

• The Memory Hub connects to the processing chips (QoS and Queue chips) via narrow high speed serial 
chip-to-chip links. The burden of memory interfacing on the processing chip pincount is thus minimised, 
reducing the overall pincount and reducing the packaging cost of a potentially complex ASIC. 

Details of the embodiment 

Refer to the Per-Flow Traffic Handler design document [1] for further details on the microarchitectural design of a 
a Memory Hub. 

Figure 1 illustrates how Arrivals and Dispatch provide the fork and join points in the packet stream. They isolate 
the memory hub which simly distributes packet chunks received from the high speed link to memory, or retrieves 
packets from memory on request. 

Figure 2 shows the content and interconnection of the Memory Hub in more detail. The bus acts as a crossbar 
between the DRUs and the memory channel controllers. This means that packets can be stored in memory blocks 
on multiple memory channels and fetched/reconstructed by a single DRU. Multiple DRUs exist to increase read 
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fenaTJ n? n FtCi°rnL d « ' 0ad ^S! f ° r Sin9 ' 6 line ° Utput (OC - 768) systems - Alternatively, in multi-line systems 
(eg. 4 x 10G Ethernet) a single DRU is assigned per line. 

The memory manager contains the central pool and the memory bitmap. The memory bitmap would be implement- 
ed In embedded SRAM and would typically be 250k bytes in size in order to support a 1 G packet m-r-rj ™ "n- 
ised as 512 byte memory blocks. H — «—..~ry -.»-n 

Figure 3 shows the content and interconnection of the DRU block. The controller supervises read pipelining orthe 
process of packet recovery from a linked list. H H s 

Figure 4 shows two possible system implementations. S channels of memory provide approximately 20 GBvtes/s 
SfH^^ 4 " ' 2°JE, rand0m aCCSSSeS P» •■«»*■ This meets ^Z^mAI^S^SS^ 
Z n nS5 r ^° VerSPeed •• T i,6 w? iS conceivab,e that tbe whole system could be implemented in a single Sevce 

doing 3S! Including- fUnCti ° n 0Ver mL " tip,e deViC6S (a)> There are many good reasons for 

' Ifpvte i ^i. di Hiw i bUt ' 0n and second-level interconect of a large number memory devices on a PCB is 
alleviated if all devices are not clustered arount a single highly integrated processor. 

' J^ a L b tfo?^mfI 0 „^ SerVe t '? ht Spec ifications on electrical characteristics, signal line separation and line 
termination if memory channels are clustered densely around the perimeter of a single chip. 

• c p a7;^ BStaS? combination of hi9h power and pincount in a sin9,e device 

The multi chip approach is also versatile in that the number of hubs used can be scaled to meet the memory re- 
qu.rement of the application (eg. scaling from 10G to 40G, and beyond) memory re 

Additional related design work 

Hub Sjllfh^ ^m 0 "? 68 in , detai ' th8 ArriVa,S and Dispatch blocks whicn cou,d be 'nstanced in a Streaming 
Hub. Whether these blocks contain intellectual property which should be protected is unclear. 

The Streaming [Huh; - Is an implementation approach which could be used to localise noisy, power hungry high- 
XS^ a single device with minimal additional logic. Dispatch blocks must be co-located on a 

single chip such as this with connections to all memory hubs so that a crossbar switch between multiple hubs and 
multiple output lines can be provided. 

JM^maisMQ^i- implements pipelined processing techniques in order to perform rapid, on-the-fly record ex- 

SSS n LTn k S i f ° r 6Very pa0keL Ra9s in tne record are set t0 indicate Aether a packet is fully 

buffered, partially buffered, or dropped in Arrivals (dependant on the status of the packet buffer). This enables the 
processor to perform appropriate housekeeping and error reporting. 

t T h h c e M iSPatC u h '.? Ck '; l l 3 °, MA engine Whlch reads a stream of records - retrieves the corresponding packets from 

M TS? IT f0rWardS th6m t0 the ° UtpUt "' ne - ° ispatch Uses ite out P ut buffer occupancy as a servo 

signal to the memory hub to control the rate at which the Hub delivers packet data 



Reference material 

[1] A. Spencer "Per-Flow Traffic Handler" 

- Original design work for ClearSpeed Traffic Handling solution 

3.6 Key features of the invention 
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Traffic Handler patent -summary.. >7ffl^^^^^^^B^S^^^SM^»| 



• Queue management and memory management are fully decoupled - (manifested by the packet/packet 
record concept. This addresses criterion (4)) 

• . There is direct streaming of packets into dynamically allocated memory with almost no first hand state 

manipulation - (supported by the memory manager and the use of the local pool. This addresses criterion 
4) 

• Memory address space fragmentation and the availability tracking using the memory managers bitmap 
provide necessary support for dynamic memory management - (helps to meet (3) through efficient memory 
usage) 

• Usage of high speed serial chip-to-chip links to Memory Hub(s) - (Remote fanout to multiple memory chan- 
nels enables address and data bandwidth to be scaled to meet requirement without meeting implementa- 
tion limits - addresses (1) and (2)). 

3.7 Scope of the claim 

Specifically, a claim is made to a method for implementing a large number of independently addressable memory 
channels using high speed serial links to memory hubs, to be applied to Traffic Handling on network linecards. 
Traffic Handling is characterised by high line rates and high speed random access into a large memory. 

More broadly, the approach could be applied wherever (latency tolerant) high speed random access is required 
into a very large memory. 
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4. The state element 



4.1 Background 

Smi^if en ^ fee WherBby 3 fUnCti0n muSt be P erformed °n a continuous stream of data. If the function is im- 
SSTSif i?2r!l °" 3 prOCe j 5s ° ri then each data 9 ram (P^ket of data) which arrives in sequence from he 
stream must be stored processed and then forwarded. This process will take some finite quantity of time to exe- 
cute. As the rate of packet arrival increases there will come a point at which a single processor can no Sger keep 
™~ then either be distributed across multiple processors arranged in a pipelhe, or ac"oss mul- 

T?? Paf9 e ' • SaCh reC6ivin9 3 packet from the stream tur " 'n some round rob!n se- 
quence. Packets output from parallel processors are typically reordered before forwarding. 

I^ S thir fin n e ( ;rh S a i 0n9 H aS - there * "° interde P enden ce between the processors. They operate independently of one- 
another, perhaps sharing a common code or data store into which all have read only access. 

4.2 The problem and prior art 

A problem arises when such processors share state variables for which both read and write access is required 
flit ZrtZTSt ^ Perm S 6d tD simultaneo " s 'y read/modify/writeback a shared variable as thTSsuit fromThe 
cStfesue? overw "««" ^ the second. It is necessary to serialise the accesses. This raises two slgnifi! 

1 . A System for interlocking processors together must be implemented so that they may arbitrate for a 

™T£nr«™ L a e n"J° J Wh6n th6re '! contention - ™ s c °ntrol signalling can be complex and add signifi- 
cant functional and performance overhead. 

2 ' 7^Jn^ eSSOr ^\ SU , CC Tt Mly ne 9° tiated fora res °"rce, it should use that resource and then release 
.t as soon as possible to limit the delay imposed on other processors. If access latencies are long to exter- 
nal memones, this can impact heavily on system performance. 

ac^ss^ processors, or contro. iogic and caches can be used to intercept concurrent 

hlriZrl SSf hSm: howevar - these can ba complex, slow and/or require significant support tied into 
hardware. Embedded memory can reduce lock out time, but delays can still be significant. 

4.3 Summary of the invention 

A smart memory cell for serialising accesses to shared state variables. 



14 

Company Confidential 



Document Number VRev 

Confidentiality Level: RED ® Copyright C/earSpeerf technology ltd 2001 




T.raftj^ Ha n d ley patfn t s u m rparyL 



4.4 List of attached figures 



(a) 



(b) 
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Internal ' 
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Figure 4. 1 illustrating the advantage of the state element concept (a) conventional approach, (b) using state e/e- 

ments. 
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Figure 4.2 Functional overview of the state element 
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Figure 4.3 Implementation overview of the state element 
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F/gt/re 4.4 Implementation examples of the state and command units 



4.5 Detailed description 

State elements are the key components which perform the serialisation of acesses jnto a shared memory. This 
patent describes only the state element. In a real system state elements must be combined in state engines and 
connected to the bus. The innovative arrangement of state elements into larger state engines (which can be con- 
nected to a system bus) is covered by a sister patent - "The State Engine". 

Description of the concept and invention 

Instead of getting parallel processors to read from memory, modify, writeback data, get them to request that the 
memory performs the modification on its behaif. in other words, position the serialisation point not within/between 
each processor, but in a simple shared processor which has local and rapid access to the memory in which the 
shared state variables are stored. 
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The state engine concept and the advantages it brings is illustrated in figure 1. 



The state element is analogous to an object ,in OOD. It has privately stored data which is accessible onlv via the 
the objecl ,SSU,n9 COmmands " para,,el Processors could be considered to be making function calls to 

^t a t^^!S!iTJr^ D !°1° f 6mbedded with single cycle read/write access time combined 

Tn Lrir^f / 9,0 Umt The Arjtnmetic Unit receives commands (from processors) which comprise 

Tr^ZT^r^A °k mm T° C ° d8 - Th9 addreSS identifies the siate variable whicr - * ™ be accessed, the daS 
SJJSS T- S ' mP ? C T PUte USSS t0 m ° dify the variab,e ' and tne command ^eots a loca ly stored 
numbl ^fSSSrS 1 ™ Cr ° Cod tl f hich ls ab,e to read - ^^ify and writeback the state variable within a very smaH 
number of system clock cycles. The result can be returned to the processor which issued each command 

These components are shown in figure 2. 
Details of the embodiment 

^JS^^SSSiSS frame W ° rk d ° CUment m fU " detaMS ° f ^ microarch « a ^-a. design and im- 

beTardwTred ^fin^TtlT^^f " embedded memor V and an Cached function - the function could either 
Z£?»2l ™. ? machine) or a programmable, mlcrocoded circuit. The latter approach is the more ver- 
satile and complex, and receives further attention in this document. 

t*Zl 5! P ' CU T e ° f J? 6 SyStem ° f com P° nent modules and their interconnection is shown in figure 3. Note 
(as 5escr!b?d°n SpD^" 3 C ° nditi ° n b, °° kS - TheSe 9reat,y eXtend the fUnCti ° nal Capabilit y of the element 

l^ e IH PhaS ' ,S '"m 131 ! e ' ement design is on the rapid memor y access s P eed - no* the processing capability Em- 

S3iT?S^S i r rt r ,0,<Bh 5* Sin9,e 0/016 aC ° eSS time is achievab ' a - Configurable rKhs pos- 
hit S^^S Period as it is possible to perform a simple arithmetic operation on the result of a read and 

S~ L , , n «7 d ^ Wnt ^ Da0k WKhin 4,16 sec ° nd cyc,e - T yP lca,, y' a command could be fully processed within 
SI 3 ? 5 c 'ock cycles. Figure 4 illustrates the simplicity of the arithmetic unit, and how the path betweenThe 

muTtiTitm<fnf n « d ^ 38 m ' nima ' d6,ay ' The '° Wer diagram shows a more complex variant in which 

iTSSSTi? ^1 memdry • The ' mpact ° n the command line turnaround (and microcode store size) 
s significant (However, this is not to say that the lower circuit should not be used. In a ower performance system 
with a more complex set of state it could be the preferred approach). perrormance system 

Additional related design work 

Reference [1] also covers some algorithmic techniques used in conjunction with state elements. 
S^mJhffiada;. Background, system threads could be programmed to operate on the data In the state memory 
which are idle Pressors are not being serviced. For instance, could be useful for identifying state entries 

^2lSSS^ ri ; h Fin ^5; ee - qUeUe SySt6m fUnCt, ° n - This is a back 9 ro ""d thread which implements a 
empty™ algorithm for de-assigning state entries used to represent/manage queues which go idle (le. 

auruTltaorS'Th 8 ' ; T I' e 3nd ' addreSS UmT are Special function units desianed to support the find free 

22Zn«S? h The features they provide are considered to be of generic value and could be used by other al- 
gorithms (such as that required for maintaining meters in state elements) 

dSSthSHSSSS .! info t rniation re " uired b y the Self-Clocked Fair Queueing algorithm cannot be mapped 
JS™ 0 * e state , e,ement 11 15 represented in a form which makes access and manipulation more robust and 
efficient. Is this a claim relevant to the state element itself or the software using the state element? (see ref [2]) 
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Traffic Ha rid ler patent 



Reference materia! 

[1] A. Spencer "State Engines - a design framework" 

- Original design work for ClearSpeed State Engine solution 
[2] A. Spencer "Per-Flow Traffic Handler" 

- Original design work for ClearSpeed Traffic Handling solution 

4.6 Key features of the invention 

• Intelligent memory - The state element localises the serialisation of parallel data accesses at the memory 
end, not the processor end. This greatly reduces the latency commonly associated with the blocking of 
state. 

• Functional versatility - The state element provides a number of (configured/)programmed remote functions 
which may be performed on the stored data - functions would comprise a small number of include data * 
read, write, arithmetic operations and conditional accesses. 

• Flexibility - The functions can (but not must) be expressed in microcode so that the state lement remains 
programmable and does not 'tie' software executing on the processor to functions hardwired into the state 
lement. 

• System efficiency - The read/writeback occurs between the ALU and the memory inside the state element. 
Only the command travels across the system bus. This reduces the burden on the system bus as com- 
pared with conventional approaches. 

• System simplicity - The read/modify/write is encapsulated within the state element and serialisation is 
inherently enforced by the state element logic. Processors can simultaneously issue commands which will 
cause a function to act on the same item of state without having to first negotiate with one-another. 



4.7 Scope of the claim 

The problem was identified while using MTAP processors to access shared state in a Traffic Handling application, 
and state elements were conceived to resolve the contention issue. 

It is recognised that contention is not an issue exclusive to Traffic Handling, therefore state elements could be used 
as a general purpose tool in support of MTAP processors in any application. 

Most broadly, contention can arise when any two processors in a realtime (data flow processing) system require 
R/M/W access to a shared state variable. State Elements could therefore be used in conjunction with any parallel 
or pipelined arrangement of processors. 
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5. The state engine 



5.1 Background 

SoiTnf °?f n WherSby fUnCt, '° nS mUSt be P erformed °" a continuous stream of data. If the functions are 
implemented in software on a processor, then each datagram (packet of data) which arrives in sequence from the 
?£ 3 T T f ° p ' oce , ssed and then forwarded. This process will take some finite quantity of time to exe- 
cute As the rate o. packet arrival .ncreases there will come a point at which a single processor can no longer keep 
up The function must then either be distributed across multiple processors arranged in a pipeline, or across mul- 
tiple processors arranged in parallel - each receiving a packet from the stream in turn in some round robin se- 
quence. Packets output from parallel processors are typically reordered before forwarding. 

This is a well proven approach to high performance packet processing, but is limited in its scalability as the number 
of processors increases. Access to shared memories, be it for code or data, eventually becomes a bottleneck Si- 
^2?vr°ont2tlon CCeSS fo Shared StatS Wi " fUrth6r f ° thS comp,exit y of svstem contro1 signalling in order to 

MTAP processors resolve traditional issues relating to instruction lookup, and State Element technology supports 
P I OCes e s,ng j*!**" 8 Pv 'Rising and managing serialisation to shared state. (Both technologies are pro- 
tott^nH Pe ,« d , Techn ° Ltd >- This leaves i^ue of high speed access to multiple items of shared state 
"formation by multiple parallel processors. As the number of processors and the complexity of their algorithms 
increases address and data bandwidth requirement over the system bus to the shared data will also increase. 
I n is can then become a bottleneck. 

A good case in point is the challenge of Traffic Management in network routers. A significant, recognised issue in 
per-flow. Traffic Handling is that a number of items of state need to be maintained for each of a large number of 
queues The .implications of this are that (a) a considerable volume of shared memory needs to be implemented 
(b) a lot of memory address bandwidth is required if each queue requires that separate accesses be made to dif- 

r 6 ™5? ar ! d) S l Var '! bl SS * (C) the memory access ,atency is like| y to be lon 9 thus caus ing state blocking dur- 
ing modification to impact on performance. 



5.2 The problem and prior art 

Contention for shared state variables can be resolved by implementing state elements as described in reference 
[1] and in the State Element' patent proposal. However, the successful implementation of the state element con- 
cept in high performance systems requires additional innovation to overcome the following: 

1. Arranging processors in parallel can create a high rate of access to the same item of state. 

2. What if a given function needs to access to multiple variables from the same address. In other words 
needs to access and process a state record? 

3. What if multiple functions executing in a processor on a single datagram each require access to different 
independently addressable tables of state variables or records? 

thafundamental Problem being addressed is that of a high rate of state access. This problem must be 
solved in a flex.ble way which enables the easy scaling of both the quantity of state being stored and the rate of 
siate access. 



5.3 Summary of the invention 

A formal framework for designing an active state storage system using state elements 
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5.4 List of attached figures 
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Figure 5. 1 Conceptual design hierarchy of the state engine 



Document Number V Rev 

© copyright ctearspeed technology ltd 2001 Confidentiality Level: RED 



21 

Company Confidential 



Pipeline state elements In Stafe Cells 
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F/gure 5.2 Functional view of a state engine 
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F/grt/re 5.3 Implementation of the state cell 
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Figure 5. 4 The Queue State Engine - an implementation example of a complex state engine design for Traffic Han- 
dling and queue management 



5.5 Detailed description 



Description of the concept and invention 

A state engine can be built up in a structured and well defined manner using the state element as an atomic part. 
Just as atoms are the components of molecules, which may be the building blocks of simple cells, which then com- 
bine into simple organisms - state elements are combined into state cells, which are multiplied into state arrays, 
which may be grouped together to form state engines. 

This hierarchical design framework is illustrated in figure 1. The component parts shown are: 

State Record - This is a conceptual entity only, It is a group of one or more state variables which share a common 
address. 

Command line - A message sent by a processor to the state engine. Fields in the command line include command 
code, address and data. The processor is effectively requesting that the function indexed by the command code 
be performed on the state record at the given address. Parameters can be both supplied and returned in the gen- 
eral purpose data field. 

State Element - As described in reference [1] and the sister "State Element" patent, a state element is a small, 
private memory which contains state variables accessible only via functions executed by the state elements con- 
trol logic. Functions typically read a state variable, perform some modification and write a new value back. The 
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.omiun poini oy executing a simple function on memory at maximum speed 

f££££T. fflT,2SV2Sr irsfi * 't™' 8 * 0 , 1 " 9 ^ *"" re reOTd to te — 
%1!$S%R£$%£)Z£Z S 'o,™.^f m ? em r T° ? e " 5 mU5 ' te rete,h "* •»»»'" «<xum. 

be stored In a sinote St»t= r.i? ^ ?' P ? m " on 0,8 numb6r of instances of a slaie record which may 

addreS^nT*^^^ 3 » ™«*" 8 ' ■*■»—* 

USSS^S^** ^ ' S ' 0 ProV ' d8 Sca,abte ""«*>'• « -~ f-vldas a maans fo, scaling address 

' constract a block which can be configured and accessed via a system bus. Components Include: 

♦ Bus interface logic 

' 2£ e . m -S° ntro ' ,t>3 ' c " T, ne slale *"* ne controller may Issue (pnvate) system commands to the state 
State Engine behaviours include: 
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Details of the embodiment 

Example state engine achitectures are documented in the "State Engine design framework" document [1]. 

the design of the state element is detailed in the S"State ELemenf* patent. 

The design of the State Cell which stitches Elements together in a pipeline is shown in figure 3 

Additional related design work 

System threads; - 
Reference material 

[1] A. Spencer "State Engines - a design framework" 

- Original design work for ClearSpeed State Engine solution 
[2] A. Spencer "Per-Flow Traffic Handler" 

- Original design work for ClearSpeed Traffic Handling solution 



5.6 Key features of the invention 

All of the specified issues associated with high speed data lookup by parallel processors are addressed: 
• A formal framework for creating a parallel coprocessor using smart memory (state elements) 

' mSnS?e8^^th;- 00kU ^ in , t f °, d, yr ent teb,es are not fired off from a P°' nt «°«rce Into different 

KSttSK tSSZ %sr mand ,ine) ,s routed from tab,e 10 tab,e ln a serial 

anTsomelime^S a,ong the P j P e from tab,e to tab ' a they are used 

<SS^1^X^iS^^^ access - The data inserted int0 the command b " one tabIa 

State cell concept - high throughput pipelined processing (scalable processing power) 

State array concept - "layout friendly' scheme for scaling quantity of state, bandwidth and load balancing 
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State engine concept - multiple orthogonal lookups from a single command uses switching technology for 
multl-Iane state engine architectures. Controller provides system commands for data and instruction broad- 
casts. 



5.7 Scope of the claim 

The problems were identified while using MTAP processors to access shared state in a Traffic Handling applica- 
tion. State engines were conceived as a way to arrange the state elements (required for. managing state conten- 
tion) in a way that addressed the additional issue of a high rate of state access. 

State Engines can also be architected from the same or similar state elements to meet the needs of other appli- 
cations - for instance meter management in the related area of Traffic Conditioning. It is therefore speculated that 
state engines could be used to deliver state element technology to any other application in which parallel (or even 
pipelined) processors are accessing shared state at high rates. 
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6. Programmable orderlist manager 



6.1 Background 

A basic understanding of router anatomy and- traffic handling is assumed. 

In traffic handling packets may be placed in one of a number of queues. With more than one queue present a 
scheduling function must determine the order in which packets are served from the queues. The scheduled order 
is determined principally by the relative priorities that the scheduler places on the queues - not on the order in 
which packets arrived at the queues. The scheduling function is thus fairly serial in character. 
For example, consider the two popular scheduling methods: 

1 . Fair Queue scheduling - every packet in the queue is given a finish number which indicates the relative 
point in time that the packet is entitled to be output. The function that serves packets from the queue must 
identify the queue whose next packet has the smallest finish number. Ideally, only after the packet has 
been served and the next packet in the same queue been revealed can the dequeueing function make its 
next decision. 

2. Round Robin scheduling - Queues are inspected in turn in a predetermined sequence. On each visit a pre- 
scribed quota of data may be served. 



6.2 The problem and prior art 

The fundamental problem is how to peform such scheduling algorithms at high speeds. A serialised process can 
only scale with clock/cycle frequency, or by increasing the depth of the processisng pipe which makes the sched- 
uling decision. This approach to scaling runs out of steam at 40 Gbits/s line rates when the available silicon 
processing technology may only be able to provide a couple of system clock cycles per packet. 

On top of this, the scheduling and queue management task is further confounded by a requirement for a large 
number of potentially very deep queues. Hardware which executes the scheduling function in a serial manner is 
then likely to be highly customised and therefore inflexible if it is to meet the required performance. 

6.3 Summary of the invention 

A system for maintaining ordered logical data structures in software at high speeds 
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6.4 List of attached figures 
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Figure 6. 1 Concept oforderlist management using a bin sort approach 




Figure 6.2 Functional overview of an orderiist management system 




Figure 6.3 implementation of orderiist manager using MTAP processors and state engines 
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6.5 Detailed description 

introduction 

a^Tc^n^ be eXP '° ited in the ""ptaiwrtrton of the solution. When perform- 

ss. « m : 9 u e p 'i^srss^^ force in a seria,,sed so,ution ' the way forward ,s to find - wnjs, 

S^^^?:l^"^5^ to '" P ro P° sa ' describes « -novative paraHe. processing architecture 
be made at wZ LJr h£ £T .f process,n 9 cvcles P er Packet to enable the finish number calculation to 

c 

Description of the concept and invention 

Figure 1 shows the basic concept of bin sorting: 

• nZ^^fffJSZ^ 1 *" iS US6d t0 C ° ntain PaCkStS With 3 Certain *«* —bers. 

• Action is required which receives packets and places them in the appropriate bin 

' ™m£X1r ^ reqUred Wh ' Ch readS the C ° ntent ° f each bin in tum in —ndhB °rder of the finish 

• »aas«a saRasss ^ e e =^ r-s- - - — - 

• J&JSSJI K^IS^^ are sorted into bins. A 

• The final stage bins can relatively easily be sorted into actual order for output. 

snZ 8 Si™ 3 an app ? ach , that a PP«es this concept. The numbering shows the sequence of events as packets 
arnve, state .s accessed, and packets are binned etc. (Full walkthrough could be provided if necesar^) 

managea Dy pointers which locate the insertion and removal points for data to/from the structure 
' ™m£r S ™n™^ l^T,? It"? com P rls o S fore than on. sol of bins, within a sot ol bins th. finish 

range i^ie^n^^SiKit™" 8 ' ° f fimSh a0r °" a " b ' nS ° ne 5et ™* e « U3te to the 

• When a bin is emptied, it is sorted into the next set of bins 

• ^^^r^^X^^^ «* °« Whan th. sm a, te6 , 
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Details of the embodiment 

Figure 3 shows how MTAP processors can be used to implement an orderlist manager. The numbering shows the 
sequence of events which occur as packets are scheduled, binned, re-binned, sorted and output etc. (Full walk- 
through could be provided if necesary) 

• When MTAP processors are arranged in a data flow processing architecture they are well suited io ihe 
processing of a high speed stream of packets. They naturally operate by performing batch reads of data 
[1], doing some processing, and then pushing data out onto queues. 

• State Engines used as hardware accelerators can enable the MTAP processors to store and manage the 
logical stats required for the bins. 

• The bins are most conveniently implemented as LIFO stacks. This minimises the required state per bin, 
and simiplifies the management of bins as linked lists in memory. 

• When each packet is stored in a bin its location in memory is retained in the state engine. This can be used 
as a pointer by the next packet which needs to be written to the same bin. Each bin is thus a stack in which 
each entry points to the next one down. 

• A databuffer block is used to store the bins. The block contains a bin memory and presents producer and 
consumer interfaces to the processor [1]. The consumer receives a stream of packets and simply writes 
them to a supplied address. The producer receives batch read requests from the procesor and outputs 
data from the requested bin. 

• As each bin is organised as a linked list, it is the responsibility of the producer to extract the linked list 
pointer from each packet as it is read from the bin. Using SRAM the access time should be fast enough to 
make this serialised process efficient. 

• In a real system embodiment it is not necessary to store the actal packets in the bins. Small records which 
represent records can be processed in their place. This is described in [2]. This simplifies implementation 
as the bins now store small entities (records) of fixed size. 



Additional related design work 

On-demand load balancing: The MTAP processors are split between the enqueueing (scheduling) task and the 
dequeueing (final sort) task. A sufficient number of processors must be implemented in order that they can cope 
with the transient worst case rate of packet arrival. However, the nominal arrival rate is much lower. This would 
mean that a number of processors could routinely lay idle or be underused. The proposal is that a small number 
of processors are assigned permanently to either the enqueueing or dequeueing tasks. The remainder may float. 
If input congestion is detected then the floating processors thread switch and assist in the enqueueing task. When 
the congestion is cleared, the floating processors migrate to the dequeueing task and help to clear the backlog in 
the queues. If dequeuing is well resourced, then floating processors may default to peripheral tasks such as sta- 
tistics pre-processing for subsequent reporting to the control plane. 

Shadowed memory management: This is an essential element of the orderlist management system solution. Al- 
though simple, I felt it might be sufficiently valuable as an idea to descibe it separately. 1 Any given data structure 
needs functions to read and write entries, logical state to characterise the structure, and underlying memory man- 
agement to efficiently store the structure. The MTAP processor and accelerator only acheive the first two of these. 
No mention has yet been made of maintaining a freelist of available memory and allocating memory for the data 
structure to grow into. This in itself can often incur considerable overhead. The efficiency of the orderlist manager 
is only possible because the memory management has already been performed for it as follows: 

Background: 

• In 40G traffic handling is is practical to divorce the packet buffering from the processing task. As described 
in the 40G programmable Traffic Handler proposal, packets are stored in memory within the packet buffer- 
ing system. Small records are passed to the processing system which efficiently manipulates records in 
place of packets they represent. 
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• The packet memory is partitioned into small blocks of fixed size. A free list or bitmap is maintained which 
keeps track of which blocks are allocated and which are free. 

• The bitmap is used by the memory management system to dynamically manage memory. Packets can ber 
streamed directly into memory on arrival from the switch fabric, with the small record of metadata retained 
for processing. 

• Most significantly, the record will contain the memory address of the (first) block of memory in which the . 
packet is stored. 

Packet record handling and storage 

• Packet records are stored in a data structure by the orderiist manager. This will require the existance of 
two resources - storage for the logical state describing each bin in the structure, and storage for the 
records themselves. 

• State storage and bin manipulation are implemented by the QSE and MTAPs respectively. Bin memory 
management relies directly on the packet memory management. 

• Bin memory management concept 

• The memory provided for record storage is organise d so that it mirrors the memory provided for packets. 
For each memory block in the packet memory there is a corresponding location in the record store at a 
directly related memory address. 

• When a packet is stored then the memory block into which it is placed must be free. The location in the 
record memory must also then be free. 

• When the record is scheduled, the memory system recovers the packet and the memory block occupied by 
that packet is released. Simultaneously, the corresponding record location is released. Because the stor- 
age and retrieval of the packet record is effectively "nested" within the time over which the packet is stored 
and recovered, the system is very robust. 

Use of pointers: 

• As records are randomly stored within the record memory, the records belonging to a given bin must point 
to one another in a linked list arrangement. 

• The record contains the pointer to the packet memory. The pointer also then points to the records own 
location in its memory. - 

• In effect, a record both points to itself and also to its neighbour. The same information is being stored twice. 
Consider records A and B which are adjacent in a linked list. Record A has pointer 'SelLA' which is its own 
location, and 'Next_A' which is a pointer to the next record in the list (record B). Record B also has 'SelfLB 1 
and 'Next_B\ It can be seen that 'Next_A' is the same as 'SeHLB'. 

• Only the 'Next* pointer is actually required in each packet. When a bin is read (in order A, B, C....) each 
record can have its own pointer identity restored by retrieving it from the record before it in the list. This 
provides a considerable reduction in the record storage requirement. 

Key points: 

• Two storage systems can share the same memory manager when the write/read accesses to one are 
nested within the write/read .accesses to the other. 

• Implied data - A records own pointer identity when it is not stored is interchangeable with a linked list 
pointer when the record is stored. A translation must occur when the record is passed in and out of stor- 
age. 

Software algorithms for bin management - Novel algorithms for managing the bins have been/are being devel- 
oped by Ken Cameron. This work might either form part of this proposal, or a proposal in its own right. Refer to 
reference [3] for further details. 
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Reference material 

[1] ClearSpeed "Network Processor architecture document - V1.0" 

- Architecture proposal for ClearSpeed 40G Network Processor 
[2] A. Spencer "Traffic Handier architecture document - V0.1" 

- Architecture proposal for ClearSpeed 40G Traffic Handler (in progress) 
[3] K. Cameron "Software Queueing for Traffic Management" 

-Discussion document 

6.6 Key features of the invention 

• A single orderlist Is used instead of multiple queues, 

• A method of iterative binning is described which makes the management of large orderlists very efficient 

• engine? management fs P erf °rmed entlrelty in software using MTAP processors and accompanying state 

• The processing and state resource can be partitioned to provide either single or multiple orderlists. 



6.7 Scope of the claim 

The claim is considered to be original on two fronts: 

Firstly, it is an alternate way of queueing data. Packets (or packet records) are queued directly into an orderlist 
instead of in separate per-stream queues. 

Secondly, it is unlikely that MTAP processors have been used previously to perform the queue management - ei- 
ther in terms of the process of making the binning decisions, or in the use of state engines for bin pointer manage- 
ment. 

Because the invention relies on binning and data structure management in software, it is also speculated that al- 
ternative data structures could be mapped into the hardware resources and managed by different software proc- 
esses. This implies that the invention could have broader application beyond that of suppoorting Traffic Handling. 
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7. Overlapped virtual queueing 
7.1 Background 

Some prior understanding of Traffic Management, Traffic Handling and (virtual) queueing would be a benefit. 

Per-flow traffic handling requires the independent queueing of traffic belonging to hundreds of thousands, if not 
millions of different connections. Virtual queueing is a text book approach in which a limited number of physical 
queues are shared dynamically between a much larger number of connections. In virtual queueing a queue is as- 
signed from a common pool to a stream (flow or flow aggregate) when that flow becomes active. Conversely, if 
an assigned queue remains empty for a given duration, it is effectively inactive - an unused resource which should 
be returned to the pool. Observing that a.finite memory volume limits the number of packets which may be buff- 
ered at any given time, it then follows that a finite pool of shared physical queues can be used to support a much 
larger number of end-to-end connections as the connections can not all simultaneously have traffic backlogged in 
the traffic handling system. 

So, the idea is that if a connection does not actually have traffic backlogged in a queue at a given point in time, 
then its queue is unused and might just as well not exist. Queues are allocated and deallocated on a per-demand 
basis. A queue is only assigned to a connection when that connection has backlogged traffic. 



7.2 The problem and prior art 

The implementation of virtual queueing presents its own problems: 

1 . How are queues deassigned? This could be tricky if there are a number of points in the handler at which 
packets belonging to a given connection could be buffered. 

2. How are queues assigned to new connections? Packets belonging to new connections could appear and 
make an on the spot request for a queue. 

3. The purge that is necessary in-between a queue being deassigned and it being assigned to a new connec- 
tion requires significant system wide messaging and state synchronisation. 

This last issue is the core focus of the ClearSpeed "Overlapped Virtual Queueing" concept. 

Consider the simple high level view of traffic handling behaviour illustrated in figure 1. 

Stream labels attached to packets arriving at A are used to iook up a destination queue identifier in B. This queue 
identifier is then used at C to access a table of queue state in D thus enabling the packet to be appropriately en- 
queued in E. Subsequent de-queueing by F must also access the queue state associated with each packet when 
it is served. 

There is a close relationship between the information held in B and the organisation of state in D. In the context 
of virtual queueing these relationships are more specifically: 

• Lookup entries in B must be created when packets with unrecognised stream fabels arrive at A. These 
entries must point to an available Q-state entry in D. 

• Based on accesses from both C and F, D must be capable of determining (a) whether a queue is empty, 
and (b) whether the queue is eligible to be de-assigned and returned to the pool in B. 

• ' When D de-assigns a queue, then the related entry in B must be removed.' 

The implementation of these behaviours should address the problem of pipelining effects caused by packet/mes- 
sage buffering within ail links shown in the figure -ie. packets of a newly assigned/deassigned queue can persist 
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in tt^rJ^LT A " °, an t F - Th6 VlrtUa ' qU6Uelng SO,ution must a,so ,m P° se minimal overhea d on the system 
in terms of logical complexity, messaging bandwidth and the storage of additional state. 

7.3 Summary of the invention 

A low overhead method for setting up and tearing down virtual queues 

7.4 List of attached figures 



Packet 





Figure 7.1 Simple schematic view of traffic management 
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Figure 7.2 Queue lookup table 



7.5 Detailed description 

Introduction 



th^ZSon t° n h m ' 9 *l SSi9 u 3 VirtUa ' qU6Ue from an inactive ^nnection, wait for any residual packets for 
bScT r I*Z I°, T T r9 f ^ qUeUeS ' and then reassign tne ^ eue to a new connection. This is possible 
tual SJ22^22 f °°S ni S ' gna, " n9, ad . dit, ° nai State and synchronisation across the system. Overlapped vir- 
tual queuemg eiim.nates the purge purge phase and make deassignment and reassignmnet simultaneous 

mJnfnnL^h 6 b f inQ ei 1 th .T assigned or "nassigned to a connection, it is either assigned or pending re-assign- 
to a conllect^ P m 9 3 qU6Ue aCtUa " y bS unassi 9" ed - ™ s maa "s that a queuenormally always belongs 

t£l ZtZZ^nToZl Solfet * B " * COnneC "° n * the 

^nll!fr^^ ent °t raassi 9 nment a new connection takes ownership of the queue and the old connection may no 
longer place any further packets in it. The old connection wilt be granted a new queue as and when further packets 

Deassignmnet is thus implicit in reassignment - there is no explicit messaging involved. 
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Instead of purging the residual packets of the old connection from the many places that they could be hidden in 
the traffic handling pipeline, those packets are left alone and are effectively Inserted at the start of the new queue 
The handover is managed in a controlled fashion to minimise the disruption to the new connection, with old pack- 
ets perhaps being prevented by software from causing certain disruptive state modifications when they are de- 
queued. 



Description of the concept and invention 
Queue looku p : 

As queues are dynamically assigned, packets must be tagged with an identifier which identifies them with a spe- 
cific traffic stream, connection or class of service. This tag enables the packet to be mapped to an assigned queue 
through the use of a lookup table. A table with behaviours which specifically support overlapped virtual queueing 
is illustrated in figure 2 and described as follows. 

The table can be indexed in one of two ways. 

1 . An entry is identified by content addressing using the key. 

2. The value (of length N bits) can be used to directly index the table of 2 N entries. 
The required behaviour uses these addressing modes as follows: 

• A key is presented to the table in order to lookup a value +. additional data. An exact match is required. 

• ^rh^ff Ue t ' S att ^ Ch -i d £ each key used for the Iooku P- lf the looku P is successful then the 
attached value is returned with the results. 

' J« taJlS!? th t er \ a new , ent I y J 8 created in the tab's- The key and a default value for the additional 
data are inserted into the table at the entry indexed by the attached value. The lookup returns data from 
this new table entry and a NULL value in place of the attached value. r 

• The table must be subsequently accessed (indexed by either key or value) so thatthe additional data field 
i?theaSon°a P / "date field* 1 ' ^ happenS ' safe defau,t va,ues for tne various QoS Parameters are placed 

Entry creation which overwrites an existing entry is a required behaviour of the overlapped virtual queueing sys- 
tem. Overwriting is the mechanism of simultaneous queue de-assignment and re-assignment. 

A pool of the identities (value) of queues which are pending de-assignment is maintained locally by the queue 
lookup block. As described, values are taken from this pool to append to each lookup. Unused values are returned 
to pool when the result is returned. 

State management : 

In order to support the queue lookup block, an additional function is required which can monitor the activity of 
queues, deassign idle queues, and pass their identities to the queue lookups pool. 

This function is most obviously associated with the dequeuelng processor (or control logic) as it is during dequeue- 
HS-nfi? T y ^ Pty qU , eUB Can be detect ed. The method of idle queue detection is not the primary focus of this 
♦hi "I ? V ' S !? 0t described ^her here. A solution to queue monitoring and deassignment is described in 
the embodiment section next. 

What matters is how the overlap between old and new connections is managed. A key control flag is carried with 
fhoflrcf !' * f rr6d t0 33 Connectlon -Status. Between queue identification and enqueueing, this flag identifies 
tne nrst packet of any new connection. Between enqueueing and dequeueing this flag identifies packets belonging 
to a queue which is pending de-assignment. This information is essentia! in managing the smooth handover of a 
queue in a robust manner. For instance, Connection_status could be used in the following ways- " 
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1 . Differentiating between the first packet of a new connection and residual packets of an old connection 
which are buffered between the queue lookup and enqueueing logic at the time of reassignment. When the 
queue monitoring function re-categroises an idle queue as 'pending deassignmenf it must mark that 
queues state. Some time later, the first packet of a new connection wiii arrive with the Connection_status 
flag set. This provides two clear reference points for the queue monitoring an enqueueing logic to work 
from. In the limbo period in-between, the any function acting on the queue state functions can recognise 
and conditionally handle any packet belonging to the old connection. For instance, the queue monitoring 
function may ignore queues which are pending reassignment. The enqueueing function may clear the 
'pending' status of the state only when it sees the Connection_status bit set. The status is unaffected by 
residual packets of the old connection. 

2. Differentiating between packets of the new connection and residual packets of the old connection which 
are backlogged in the queue structure, it is possible that residual" packets which are backiogged in the 
queue memory later upset the Queue State when they are scheduled and the dequeueing function sends a 
state update. Again, the queue state update function may be configured to deal subtly different with old 
and new packets. 

Finally, note that if a 'long term idle' stream does spring back into life in this way after its queue has been deas- 
signed then the moment the queue is reassigned to a different stream, then the old stream is simply assigned to 
a new virtual queue. There is thus almost no overhead to virtual queue management and queue switching. 

Details of the embodiment 

The implementation of overlapped virtual queueing is described in detail in reference [1]. 
For the queue assignment: 

• Content Addressable memory is an ideal memory technology for the table memory. 

• The memory is supported by a function which provides and recycles the "free queue 1 identities. 
For the state management: 

• A state engine is an ideal basis for the queue monitoring function. A background system function can be 
defined which scans through all queue state in the state elements, looking for idle queues. The 'find free 
queue' algorithm is described further in the state engine patent (?) 

Reference material 

[1] A. Spencer "Traffic Handler architecture document - V0.1" 

- Architecture proposal for ClearSpeed 40G Traffic Handler (in progress) 

7.6 Key features of the invention 

• After initial assignment, a queue will always belongs to a connection. It either has the state of being either 
asssigned or pending re-assignment. 

• In the pending reassignment state it may still be used by the old connection. Use by newly arriving packets 
beloinging to the old connection is terminated at the moment of reassignmnet 

• Deassignmnet is implicit in reassignment - there is no explicit messaging involved. 

• After reassignment, instead of purging the residual packets of the old connection, those packets are effec- 
tively Inserted at the start of the new queue. The handover is managed in a controlled fashion to minimise 
the disruption to the new connection. 

• Queues are only deassigned when they are both empty AND there is a demand for resource. There is not 
measured timeout period. 

7.7 Scope of the claim 



36 

Company Confidential 



Confidentiality Level: RED 



Document Number VRev 

© Copyright ClearSpeed technology ltd 2001 



Traffic handler patent summary.: ! ; 



™ S r f'T 0 ! a, j S 0n J- y r ! l6Va . nt t0 Traffic Handlfn 9 application in which many (hundreds of thousands of) queues 
are required. Specifically, this therefore relates to per-flow Traffic Handling. 
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